System and method for real-time processing, storage, indexing, and delivery of segmented video

ABSTRACT

In some embodiments, a method for capturing live video content comprises capturing video signals in a video capture server connected to a video source; receiving the captured video signals from the video capture server in an initial processing server; processing the captured video signals in the initial processing server to provide at least video files for storage and text files associated with the captured video signals; receiving the associated text files in a topic extraction server in communication with the initial processing server and performing contextual topic extracting and processing to provide additional searchable contextual information; and storing the searchable contextual information and placing this information, along with the original text files in a searchable database archive, the stored information being associated with the stored video files.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation and claims priority to U.S. patentapplication Ser. No. 14/319,074, filed on Jun. 30, 2014, entitled“System and Method for Real-Time Processing, Storage, Indexing, andDelivery of Segmented Video,” which is a continuation and claimspriority to U.S. patent application Ser. No. 13/436,973, filed on Apr.1, 2012, issued as U.S. Pat. No. 8,769,576 on Jul. 1, 2014, entitled“System and Method for Real-Time Processing, Storage, Indexing, andDelivery of Segmented Video,” which claims priority to U.S. ProvisionalApplication No. 61/470,818, filed on Apr. 1, 2011, entitled “Real-TimeDelivery of Segmented Video,” all of which are incorporated herein byreference in their entirety.

TECHNICAL FIELD

The present disclosure relates generally to systems and methods thatprovide for indexing, storage, and access to video broadcasts.

BACKGROUND

Broadcast television is a constantly changing medium with linearprogramming schedules. Multiple forms of recording devices exist tosatisfy a consumer's need to record selected programming at their ownconvenience, but these require consumers to know in advance whatprogramming they want to record. Programming that has not been recordedcannot be viewed later.

Broadcast television is localized by satellite, cable, or antennacoverage. Even though content partnership between networks is common,the delivery is still regional.

Internet Protocol television (IPTV) 202 solutions are emerging todeliver content ‘on demand’ by exploiting the internet as a globaldelivery medium, but the large cost of bandwidth and streaming servicesfor long form content delivery, coupled with licensing costs andrestrictions, hampers wide scale distribution.

There is also an infrastructure and development cost to create such adelivery platform. These costs mean that a company must have eitherlarge-scale user numbers, or premium content must be introduced toattract this audience and achieve a viable income.

User generated content sites such as YouTube have begun to attract theattention of content producers as a medium for delivery, in particular,time-sensitive content such as news broadcasts. These sites go some wayto providing content to the user in a timely manner, but indexing isdriven by manually generated program titles, descriptions, tags, andother processes that cause delays. For news information in particular,the absence of video content within a search engine's ‘real-timeresults,’ is an indication of a problem in this process—in particularwhen the story has already been aired, but a user must wait for someoneelse to manually add a news story in order for them to watch it later.

Video advertising remains largely rooted in its broadcast televisionfoundations. Advertising is based largely on broad channel or programdemographics rather than explicit information about a program's content.On the internet, text-based advertising such as Google Adwords, hasproven to have much more value with context-sensitive advertising.

While the increasing use of mobile devices delivers an emerging base ofconsumers, traditional long-play program formats are poorly suited tothese users and their devices. Several formats have been defined anddeployed for delivery of television streams to mobile devices. Theseformats, such as Digital Video Broadcasting-Handheld or DVB-H, arefocused on replicating the television experience on a mobile device butdo not address the more common use cases for mobile devices, which favorshort-form content.

SUMMARY

The systems and methods disclosed in this specification capture videocontent, segment the content in real time, can sort the video contentinto clips by topic, and can delivers those clips as a customized queueof video items relevant to users according to their interests, asdetermined by their social graph data and/or their manual interestprofile configurations.

The disclosed systems and methods can segment long-form video contentinto smaller clips based on the topic of the content. This enables thegeneration of a searchable index of short, highly relevant videoresults. The created index is not only useful to the user, but alsoprovides advertisers a deeper context against which relevant advertisingcan be selected.

In addition to enabling ‘directed search’ results for keyword lookups inthe index, the disclosed systems and methods enables providingrecommendations in the form of a custom video queue or otherorganization means. These recommendations may be based on users'interests. The catalogue of user interests is built by aggregating userinput, social network graph information, and usage feedback.

Video processing may be performed in real time so that live or currentlybroadcasting video content is indexed as it airs. This enables theSystem 100 to immediately notify users through push mechanisms whenrelevant content is available and, where the client supports it, deliverthe related video. This real-time processing also means that the contentcan be delivered to search engine indexes to enable real-time resultswith rich video content.

The index offers a rich data source for analysis and reputationmanagement applications. As a whole, the combination of user interests,demographic information, and behaviors against sentiment aware content,forms a valuable asset for brands and advertisers.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present disclosure and itsadvantages, reference is now made to the following description taken inconjunction with the accompanying drawings, in which like referencenumbers indicate like features.

FIG. 1 is an illustration of an example embodiment of the System 100 asa whole;

FIG. 2 is a high level illustration of an example path that incomingdata travels from collection to processing;

FIG. 3 is a high level illustration of an example path that data travelsfrom storage to user interface;

FIG. 4 is a detailed illustration of an example embodiment of theCapture Platform 110 and its functionalities;

FIG. 5 is an illustration of an example embodiment and the possiblepaths that data takes through the functionality blocks of the searchplatform 120;

FIG. 6 is an illustration of an example of the physical division of DataStream Chunks;

FIG. 7 is an illustration of timing values as a stream of calculatedconfidence values;

FIG. 8 is an illustration of the combined factors that the System 100uses to determine a video chunk's start/end time;

FIG. 9 is an illustration of information that can be gathered fromsocial graphs;

FIG. 10 is an example of a web client;

FIG. 11 is an example of an IPTV client; and

FIG. 12 is an example of a mobile device client.

Although similar reference numbers may be used to refer to similarelements for convenience, it can be appreciated that each of the variousexample embodiments may be considered to be distinct variations.

The present embodiments will now be described hereinafter with referenceto the accompanying drawings, which form a part hereof, and whichillustrate example embodiments which may be practiced. As used in thedisclosures and the appended claims, the terms “embodiment” and “exampleembodiment” do not necessarily refer to a single embodiment, although itmay, and various example embodiments may be readily combined andinterchanged, without departing from the scope or spirit of the presentembodiments. Furthermore, the terminology as used herein is for thepurpose of describing example embodiments only, and are not intended tobe limitations. In this respect, as used herein, the term “in” mayinclude “in” and “on,” and the terms “a,” “an” and “the” may includesingular and plural references. Furthermore, as used herein, the term“by” may also mean “from,” depending on the context. Furthermore, asused herein, the term “if” may also mean “when” or “upon,” depending onthe context. Furthermore, as used herein, the words “and/or” may referto and encompass any and all possible combinations of one or more of theassociated listed items.

DETAILED DESCRIPTION

Shown in FIG. 1 is a system overview illustrating an exemplaryembodiment in which the System 100 is illustrated in the context of anumber of architecturally distinct units. This exemplary system 100 willbe shown in further detail in the other, more detailed figures anddescription that follows. Illustrated in FIG. 1 is a Capture Platform110, which captures source data received from incoming video broadcastsand converts the source data into a standardized format.

Also shown in the figure is the Search Platform 120, which, as will befurther described herein, extracts topics from source video metadata andmanages user taste graphs, the searchable index, and the ApplicationProgramming Interface (API) layer for client interaction.

Further illustrated in FIG. 1 is the Media Storage and Delivery Platform130. The Media Storage and Delivery Platform 130 is the backend forstorage and delivery of video content to Client Devices 301. The MediaStorage and Delivery Platform 130 receives the converted incoming videobroadcasts and stores it in storage within the platform 130. Thisplatform is further accessible through the search platform 120 topresent source data that is found through user searches.

Still further illustrated in FIG. 1 is the Front-End Client 140, whichis a Client Device 301 (as shown in FIG. 3) through which users interactwith the System 100. Through the Front-End Client 140, users will inputsearches and receive search results. Users will also be able toestablish their customized profiles, searches, “watch lists,” andperform other types of personalizations.

With further reference to FIG. 1 and FIG. 2, the architectural platformsintroduced above are now further described.

General Description of Platforms

Capture Platform 110

The Capture Platform 110 collects incoming source media and converts thecontent into a standardized format for use by the Search Platform 120and Storage & Delivery Platform 130. Because television content islocalized, as shown in FIG. 2 Capture Servers 201 are deployed to localgeographic regions when a data feed is not available for remotecollection, as in the case of analog signals, or when the cost oflocally processing raw data is less than that of receiving raw dataremotely, such as when delivery of raw data to a distant server willcost more in bandwidth than setting up a Capture Platform 110 locally.As shown in FIG. 2, the capture server, which is drawn in the form of aknown server implementation of a computing machine that includes a userdisplay, a user input device, and memory for storing computer-readableinstructions on a computer-readable medium. Thus, as functionality forthe Capture Server 201 is described below, the functionality performedby the Capture Server 201 would be provided and performed by the CaptureServer 201 running computer instructions that are stored on the CaptureServer's 201 computer readable medium or computer memory, or it could bebased on computer-readable instructions that are stored in an externalor other computer memory.

Overview

Still referring to FIGS. 1 and 2, and with further reference to FIG. 4,the Capture Platform 110, which includes the Capture Server 201illustrated in FIGS. 2 and 4, provides a number of functions:

-   -   To capture over-the-air data 203 and translate it into an        Internet Protocol Television (IPTV) signal 202.    -   To process raw content data.    -   To extract Text Metadata 407 from the streams—for example Line        21 or Digital Video Broadcast (DVB) encoded subtitle data.    -   To capture API-provided content and translate it into a        standardized format.    -   To capture additional content metadata, for example, Electronic        Program Guide (EPG) data from DVB streams.

Infrastructure

The Capture Platform 110 may be divided between processing elements incloud infrastructure and regionally, physically deployed CaptureServers. The regionally deployed servers convert analog streams intodigital transport streams, and convert multiplexed transport streamsinto separated video and data. All of this data is then delivered to theStorage Platform.

Video is sent to the Media Storage and Delivery Platform 130 and the EPGand subtitle encoded Transport Streams are delivered to the Cloudportion 204 of the Capture Platform 110 for further processing.

Search Platform 120

Overview

The purpose of the search platform 120 is to process incoming data andrequests for information. It does this by processing, and then indexing,incoming information; doing search functions when a user requests aspecific video clip; and analyzing trends and user usage in order tofine-tune the System 100 as a whole.

Infrastructure

Indexing

Indexing processes act on two separate sides of the System 100:

-   -   Topic Extraction & Segmentation 212:    -   The Topic Extraction & Segmentation component 212 is responsible        for:        -   processing inbound content data        -   applying real world models        -   dividing the real-time data stream into stories, or topics,            to be placed in the searchable index    -   Taste Graph Processing:    -   The Taste Graph Processing component applies real world models        to user taste graph information to better match user taste        graphs against the index.

Search

The search function includes processes for:

-   -   Directed Search:

Results related to a specific query, such as a query submitted by a userin a search text box.

-   -   Push:

Results are pushed to the user according to their registered interests.Results are also sent in real time to search engines for publicindexing.

-   -   Recommendations:        -   Custom Queue/Recommendations: results relevant to the user's            and/or their friends' taste graph of interests        -   Related Search: Results related to a specific video        -   Popularity: results based on usage feedback (shares, views,            ratings, etc.)        -   Category: results related to trending keywords, source,            locations, or categories

Analysis:

In addition to returning an indexed result to a search query the System100 is also engaged in the ongoing analysis of the index. It preformsthis process by analyzing the following pieces of information:

-   -   Trends:    -   What is popular overall, or divided by source, time, location,        category, etc.    -   Usage Analysis:    -   What users are viewing, sharing, rating or skipping. This        feedback is sent to index ranking algorithms.    -   Sentiment Analysis:

The ‘sentiment’ of content—this can be processed in the form of anoverall report for sentiment, or sub-divided by topic, source, time,location, category, etc.

Media Storage and Delivery Platform 130

Overview

In disclosed embodiments of the Media Storage and Delivery Platform 130,incoming video/audio streams 404 are marked with time data and storedaccording to source information. The incoming video/audio streams 404need not be physically divided. Rather, the System 100 can use Start/Enddata in order to determine when a relevant clip is needed, and canaccordingly wrap that chunk of video/audio stream 404 in a single topicwrapper to be sent to the user.

Infrastructure

The Media Storage and Delivery Platform 130 can use cloud-based datastorage, such as Amazon's Web Services S3 storage product. Other suchcloud-based data storage approaches, such as Microsoft's Windows Azureplatform, may be used. Further, a non-third-party or in-house storagesystem can be used. In any case, in disclosed embodiments content wouldbe pushed to this storage infrastructure using HTTP mechanisms.

Front-End Client 140

Overview

A Front-End Client 140 is ultimately any program that a user uses tointeract with the System 100 and retrieve data from it. API interactionswith the platform APIs may be provided as a component of the Searchplatform 120 specification.

Client Devices 301 are defined as any mobile, website, IPTV devices,smart televisions, set-top boxes, gaming platforms, in-car entertainmentsystems, or any other web capable device. A client could equally beintegrated with a content agency or broadcast network's existingplatforms or platform families, such as Microsoft's Mediaroom.

Data flows of the system are enabled to take certain paths as the datais being collected, processed, stored, and retrieved. These paths aredescribed in the following section as a series of steps that spanmultiple platforms and cross multiple components of the System 100.

Data Paths

FIG. 4 illustrates the data path flows of incoming video streams as theyare processed by the System 100. Further description of the elements ofthis system is provided below.

Incoming “Raw” Data to Storage:

Step 1

Incoming data is gathered by the Capture Platform 201. This data cancome in the form of at least two distinct types of input, over-the-airdata 203 and IPTV data 202. Over-the-air data 203 is captured usingphysical capture methods such as: collecting satellite signals with asatellite receiver, collecting local channels using antennas, andplugging into a cable network in order to capture cable signals. IPTV202 data may be streamed through the internet as opposed to moretraditional methods of broadcasting television data, and usuallyincludes more information about the content itself.

Step 2

The Capture Platform 201 separates the video/audio stream 404 from TextMetadata 407, assigns a unique source and time code to it, and sends itto the Media Storage and Delivery Platform 130. This unique source andtime code is later used by the index to reference the video's topicafter the Text Metadata 407 is processed. In the case of IPTV 202 data,the video/audio stream 404 and Text Metadata 407 are already separated,so the Capture Platform 110 simply has to send this data along to theappropriate storage locations.

Step 3

Text Metadata 407 is sent to the storage platform 120 where it isprocessed. This incoming metadata is divided into small stories calledtopics. These topics contain specific keywords that can be used to indexthe Text Metadata 407 and to assign context to the video/audio stream404 that is associated with it.

Video/audio streams 404 passes through the search platform 120 in orderto be segmented and assigned more specific source and time codes. Oncethis is done, the video/audio streams 404 are sent to the Media Storageand Delivery Platform 130.

Step 4

Once information about the Text Metadata 407 has been fully extracted,it is stored in the index under a unique topic that can later be used bya Front-End Client 140 in order to retrieve a specific video clip.Video/audio data is stored in the Media Storage and Delivery Platform130 in segmented chunks that can easily be retrieved when a request forthe specific clip is made.

User Query to Data Retrieval

Step 1

A user queries a specific video clip using a Front-End Client 140device. Or, a user with social graph data logs into a Front-End Client140 and the System 100 can populate the user's custom video query withvideos relevant to their interests.

Step 2

The Front-End Client 140 sends a request to the Index of TV 215 for thesource and time code of the specific video clip that the user queried,or the video clip that they system deems appropriate to display on thecustom video query.

Step 3

The Index sends the source and time code data to the Media Storage andDelivery Platform 130.

Step 4

The Media Storage and Delivery Platform 130 filters through its storeddata streams for the specific section of data that the user requested.

Step 5

Upon locating the requested video clip, the System 100 wraps the clip,along with its start/end time data and source information, in a contentwrapper and sends the combined files to the Front-End Client 140.

Back End Process

The back-end process and system elements are discussed below in furtherdetail with reference to FIG. 2.

Capture Servers 201

Incoming data is gathered by Capture Servers 201. This data comes in theform of two distinct types of input:

Over-the-Air Data 203:

Over-the-air data 203 is captured using physical capture methods suchas: collecting satellite signals with a satellite receiver, collectinglocal channels using antennas, and plugging into a cable network inorder to capture cable signals. Because these physical capture methodsare regional, Capture Servers 201 may be placed in a specific region togather local TV signals within that region. For example, if a userliving in Liverpool UK is taking a trip to the United States and wishesto know what is currently happening in that country, a regional CaptureServer 201 would mean that television streams from the United Statescould be made available to that user.

IPTV Data 202

Internet Protocol data is in some respects easier to collect. Because itis streamed through the internet as opposed to more traditional methodsof broadcasting television data, it usually includes more informationabout the content itself. IP data is available where ever an internetconnection is available, and because the internet is almost universallyaccessible, physical servers aren't necessarily. Rather, all that isneeded it to connect a Capture Server 201 to the internet and deliverthe incoming content to the Initial Processing Server 205.

This data is preprocessed in the Capture Servers 201 according to type.Over-the-air data 203 usually comes as a massive data stream, andoftentimes needs to be demultiplexed (broken up into individual channelstreams and Text Metadata 407) in order to be transported to the initialprocessing step. In certain described embodiments herein, preprocessingis not necessary. The IPTV data 202, for example, generally comespre-split and ready for initial processing. The Capture Server 201 mayfurther be operable to place the captured video streams into a regularformat that can be recognized and processed by the Initial ProcessingServer 205 below. The formatted video streams may then be sent from theCapture Servers 201 to the Initial Processing Server 205.

Initial Processing Server 205

Once data is captured and converted into a format that can be moreeasily transported; the Capture Server 201 sends this data through theinternet to the Initial Processing Server 205. Initial processing isdone primarily to separate the incoming preprocessed data and sort itinto its according processing and storage divisions. As FIG. 2illustrates, once the Initial Processing Server 205 has completedsorting, it sends the sorted data to the Text Decoding Server 208, theVideo Server 207, and the Image Server 209.

Video Processing Server 207

Video and Audio Streams are sent to the Video Processing Server 207 fromthe Initial Processing Server 205, where they are divided up and aunique time stamp and source code is assigned to the incoming data. Thevideo stream may not actually be physically divided up, but rather atime stamp may be assigned to every second or so of video. Video may,for example, be stored in hour-long chunks, but this is simply forstorage purposes and has nothing to do with the length of a video viewedby a user. When a user makes a query for a specific video clip, whichcould be any of size, the System 100 does not need to reassemble chunksof divided video. Instead, all it really has to do is look for the startand end time of the video the user requested, find the source of thatsame video, and pluck that specific chunk of video from the storedstream. If a user requests a clip that spans multiple hour-longsegments, the System 100 has the capability to retrieve the two separatepieces of information and reassemble them before delivering therequested video to the user. Once the streams have been assigned sourceand start/end time data, they are delivered to Video Storage 206.

Text Decoding Server 208

As illustrated in FIG. 2, the Text Decoding Server 208 receives incomingtext data from the Initial Processing Server 205 and extracts individualwords from the incoming Text Metadata 407. The Text Decoding Server 208does this through the use of Voice Recognition and OCR, or OpticalCharacter Recognition, applied to Subtitle data. In this way, analoginformation can be converted into a digital format that algorithms canbe applied to. This process is explained in further detail later in thisdocument.

Image Server 209

As pictured in FIG. 2, the Image Server 209 receives images from theInitial Processing Server 205, and divides them up into individualthumbnail components. The purpose of the Image Server 209 component isto take individual frames from the video stream and uses them asthumbnails. These thumbnails are used in Front-End Clients 140 in orderto give the user a preview of any given clip. Images taken by the ImageServer 209 are stored in the Image Storage 210.

Image Storage 210

As shown in FIG. 2, once the Image Server 209 has finished extractingthumbnail images from a video clip, these thumbnails are sent to theImage Storage 210 to be stored. When a user opens a Front-End Client140, a series of thumbnails are displaced as representative of thevideo/audio clips they represent.

EPG and Neilson Data 211

As FIG. 2 illustrates, additional data from a video clip's EPG andNeilson Data 211 can be used to add more context to a clip to make itmore easily categorized.

Topic Extraction & Segmentation 212

As FIG. 2 further illustrates, once text subtitle and OCR data has beenprocessed in the Text Coding Server 208, it is sent to Topic Extractionand Segmentation 212. This is where the bulk of processing is performedusing semantic and grammar processing, key term searches, input from EPGand Neilson Data 211, information from the Contextual Database 213, andother sources of information that can provide Entry Extraction 212 withinformation as to the contents of the text data. Once Entry Extractionhas collected enough information about a collection of Text Metadata 407to assign the data a Topic, the Text Metadata 407 is delivered to theDatabase Archive 214, the Index of TV 215, and the Trending Metrics 216component. This process is further described later in this document.

Contextual Database 213

The Contextual Database 213 is composed of a number of dictionaries thataid in applying a topic to a given video clip based on the translatedText Metadata 407. As FIG. 2 displays, Topic Extraction 212 makes use ofthese contexts when assigning topics to video clips.

Database Archive 214

Once topics have been assigned to the processed text data in the TopicExtraction & Segmentation component 212, the text data is delivered tothe Database Archive 214.

The Database Archive 214 substantially serves as storage for all topicinformation. This archive is described in the present embodiment as astatic index, such that once information is sent to the Database Archive214, it does not change. In this embodiment, this insures that allinformation that the System 100 processes has a solid backup copy.

If an indexed topic is categorized incorrectly, the Database Archive 214provides a means for correcting these problems if the topic is toowildly incorrect to be easily fixed. Because the System 100 is dealingwith imprecise information in the form of TV content, it is advantageousto provide a method and means for making such corrections.

Also, if a user needs a particular clip from a particular time on aparticular channel, the Index of TV 215 is not well-suited for searchingfor information according to these parameters. In the instance that thisdata does need to be found, the Database Archive 214 is better equippedto handle requests of this nature.

Index of TV 215

After Text data has been assigned topics in the Entry Extraction 212component, it is delivered to the Index of TV 215.

The Index of TV 215 is a dynamic index that continually changes andupdates itself as more information is received. This component of theSystem 100 contains all current updates of all topics, and is referencedby the Front-End Client 140 of the System 100 when looking for userrequested content. When user requests are combined in the Custom TV 306,they pass through the Index of TV 215 on their way out to the Front-EndClient 140.

Front-End Processing

FIG. 3 represents the path data takes when a user inputs informationinto a Client Device 301.

Client Devices 301

Client Devices 301 represent the basis of user interfaces. A clientdevice is defined as anything that an average user can use to connect tothe System 100. The basic form of a client device is a website with aGraphic User Interface, or GUI. However, this is by no means the onlyclient device the System 100 could connect with. Anything with aninternet connection such as gaming systems, mobile phones, tablets,laptops and vehicle media systems could easily have an applicationadapted to access the System 100 and retrieve videos for users.

There are two distinct processes through which a Front-End Client 140request and receive video content to display to users. They areuser-specific search queries and system recommendations.

User-specific search queries are things users specifically request, suchas a specific clip, “President Obama's inauguration speech” or aparameter of things the user wants to see, such as “President Obama, theeconomy, CNN.” The latter example may return several results. As far asthe Front-End Client 140 is concerned, user-specific searches are begunwhen a user types a list of parameters or a specific clip title into asearch box and clicks send.

System recommendations are a little more complex than user-specificsearches, and pull information from a lot more places than a simpledirect search. A system recommendation is activated when a user logsinto a client with a pre-established account. When the GUI uploads, partof it can include a “recommendations” box. This box might display, forinstance, several thumbnails of videos that the System 100 has reasonedthe user might like based on past searches, information gathered fromsocial graphs, behavior during different clips (such as pausing orskipping certain clips) and other factors that the System 100 can use toaggregate a user's interests.

It is also worth noting that whenever a user makes a request or clickson a recommend video clip, the Client Device 301 reports to the WebServer 302 that a video has received some sort of feedback. The WebServer 302 then sends this information to the Metrics and Trends 217component for further processing.

Webserver

The webserver 302 may act as a gateway to and from the System 100. Itacts in this regard to direct requests from Client Devices 301 to thesystem components, and to push requested content from the Index of TV215 back to the Client Devices 301.

Metrics and Trends Processing 216

The Metrics and Trends Processing component 216 is where informationabout a specific clip can be found. This information isn't necessarilywhat a clip is about. Rather, it is information about how popular a clipis, how users behave while watching it (pausing, skipping, etc.), andalso how frequently a clip is searched for, ignored, or viewed.

The purpose of compiling all this information is to place higherquality, more universally liked clips at the “top” of user access and toplace damaged, bad quality, or irrelevant clips closer to the “bottom”so that users rarely, if ever, have to view a bad clip.

Metrics and Trends Archive 217

As FIG. 2 illustrates, once the Metrics and Trends Processing component216 has finished extracting information, the data is sent to the Metricsand Trends Archive 217, where it can later be accessed.

Social Media 303

Social Media 303, such as Facebook, Twitter, LinkedIn, and other SocialMedia 303 websites, are a valuable source of data when determining aspecific user's interests. Because these sites actively gatherinformation about not only a single user, but the network of other usersany one user is associated with, a massive quantity of information canbe gathered about what the user likes, and also what the user is likelyto become interested in based on their friends' and connections'interests.

When a user first sets up an account with a Front-End Client 140, theyare given the option to connect their Social Media 303 to the client; itis at this point that the System 100 sends a request for information tothese Social Media 303 outlets. However, the System 100 can updateitself when a change is made to one of these Social Media 303 outlets.These Social Media 303 outlets return the request with a Social Graphthat the Social Graph Processing 304 component can extract informationfrom.

Social Graph Processing 304

When the System 100 receives a social graph, it sends the graph to theSocial Graph Processing component 304 in order to extract a user'sinterests and arrange them in a way that trending topics can be relatedto.

The System 100 can also make use of social graph information to addcontext when a user makes a search query. For example, if a usersearches for “Apple,” that search query may turn up results involvingapple the company or apple the fruit. However, if that user hasmentioned Apple the company or likes Apple the company in one or moreSocial Media 303 outlets, the System 100 can infer that this user isrequesting video clips about Apple the company.

Graph Database 305

Once the Social Graph Processing component 304 component has extractedthe necessary information from a given social graph, it deposits thegraph in the Graph Database 305. The Graph Database 305 is a generalstorage space used to house processed graphs in a form that the System100 can use to compile custom video line-ups for individual users.

Custom TV 306

The Custom TV 306 component assembles a group of videos that the System100 assumes a user will be interested in. It first collects social graphdata from the Graph Database 305 in order to determine a user'sinterests. It then requests relevant Video clips from the Index of TV215 and sends them to the Web Server 302 which delivers the recommendedvideos to the user's Client Devices 301.

Detailed Capture Platform 110 Description

Platform Inputs

The Capture Platform 110 can consume data from any of a number ofsources:

Over-the-Air Content 203:

Over-the-air content 203, such as Satellite, Cable, or Antenna deliveredbroadcast content, can be captured and converted into a digitalIPTVstream for processing using video capture hardware 401. This contentcan have several channels multiplexed together.

IPTV Streams 202:

IPTV streams 202 can be captured and processed. Because these streamscome directly over the internet, no physical capture device, such as asatellite dish or a cable box, is necessary to collect them. Like theiranalog equivalents, these streams can contain a number of channelsmultiplexed together. The capture system demultiplexes the signal intoindividual Transport Streams, including:

Video/Audio Streams 404:

Which are then re-combined in MPEG video files containers.

Subtitle Data:

Which is usually in DVB or Line 21 format (EIA-608, CEA-708 and ETS 30074.) The DVB format in particular poses a problem, because it is imagebased and therefore requires an additional step of Optical CharacterRecognition (OCR) processing to Decode text data from the subtitleimages to decode text data from the subtitle images. These images arepictures of the sentences that is displayed on screen assubtitles—commonly, white text on a black background.

EPG Data:

Captured IPTV 202 streams can contain encoded EPG data which is decodedfor delivery alongside the subtitle metadata. When EPG data is notincluded in the initial bundle of Multiplexed data, it can be collectedfrom commercially available internet distributed feeds.

Additional IP Information:

Content Providers expose a variety of proprietary APIs for contentdelivery; these are generally HTTP and XML protocol driven APIs.

Content APIs 408:

Content Providers expose a variety of proprietary API's for contentdelivery, generally HTTP and XML protocol driven. Rather than physicalservers, the Capture Platform 110 servers for these APIs are deployed toCloud 204 infrastructure. Content from these sources are preprocessed,so when they are delivered to the Capture Platform 110, the onlyprocessing necessary is to split the information into Text Metadata 407and video/audio 404 streams, which are delivered directly to the SearchPlatform 120 and the Media Storage and Delivery Platform 130 s,respectively.

Platform Functionality

The Capture System 110 is broken into a number of system elements, basedon location and need. The first of these system elements is the CaptureServer 201, which is operable to collect raw data, preprocess it,demultiplex it, and stream it into the System 100. The second of thesesystem elements may exist in a cloud infrastructure 204. Its purpose isto begin the process of splitting data up by extracting Text Metadata407, applying a format the Search Platform 120 can understand, anddelivering it to the search platform for Topic Extraction & Segmentationcomponent 212. The cloud infrastructure component 204 of the CapturePlatform 110 may also receives Content API data 408, splits it intoVideo/Audio Streams 404 and Text Metadata 407, and delivers it to itsrespective platforms.

Platform Outputs

After processing, the Capture Servers 201 output:

Video:

-   -   Video is sent to the Media Storage and Delivery Platform 130.

Text Metadata 407:

-   -   Text Metadata 407 is sent to the Search platform 120. Metadata        is sent as a JavaScript Object Notation (JSON) formatted object,        but any extensible data format can be used (XML, BSON, etc.)        This data can include:        -   Subtitle Text        -   API Provided Text Content (articles or transcripts)        -   Time data, of particular importance for subtitle content.        -   Program/Item Title & Description        -   Rating Information: Adult/Violent/etc. provided by the EPG            program data, time of day (watershed in the UK and Safe            Harbor in the US for example), API supplied field, or            defined by source (e.g. children's programming channel.)        -   Additional Category Information: Channel Source, EPG, or API            content may each provide category information such as News,            Music, Entertainment, etc.        -   Confidence: where voice recognition is used, result            confidence value

Detailed Search Platform 120 Description

FIG. 5 illustrates a block diagram for the search platform 120,including platform inputs, functionality, and outputs as will bedescribed below.

Platform Inputs

The Search Platform 120 receives incoming Text Metadata 407 provided bythe Capture Platform 110.

Platform Functionality

The Search Platform 120 is divided into the following functional blocks:

Topic Extraction & Segmentation Block 212

At the Topic Extraction & Segmentation Block 212, the incoming datastream is divided into Topics, small story segments extracted from acontinuous source. These Topics contain not only specific keywords, butalso the related ontologies, sentiment and confidence data.

Video Segmentation Block 501

At the Video Segmentation Block 501, the continuous video stream isdivided into smaller, more easily managed chunks of data.

Trending Keywords Block 502:

In the Trending Keywords Block, topics are aggregated according tosource, time and categories.

Real-Time Updates Block 503:

The Real-Time Updates Block 503 delivers results in real time to ClientDevices 301. These results are produced by combining data as it isreceived in real time with users' taste graph information.

This Block 503 delivers these results according to users' personalizedprofiles is operable to receive topics and match them to registered userprofiles so that the topics can be delivered to the users' ClientDevices 301 as, for example as “Breaking News” alerts or “Now on ChannelX: . . . ”.

This component manages the pairing of the topics to users and, inaddition, management of the users' volume settings, which define howoften they expect to be notified of events on their Client Devices 301.

Index Block 504

The System 100 defines multiple indexes with multiple purposes; howeverthe Index Block 504 within the Search Platform 120's functionality isprimarily responsible for tracking information related to the contentand topic information concerning specific chunks of video data. OnceTopic Extraction 212 has completed extracting the topic from a given setif text data, it delivers that same text data to the index to be filed.The Index Block 504 is also responsible for housing user tasteinformation once it has been extracted from user behavior. As FIG. 5also displays, the index is also responsible for returning informationto the Directed Search functionality block when a user makes a directsearch request.

Taste Graph Block 505:

A taste graph as provided in the Taste Graph Block 505 is thecategorization and management of user taste graph data, so that it maylater be applied to the index to produce relevant video results. Thistaste graph is a combination of information gathered from Social Media303, and information gathered from the Feedback Block 507. Process

By combining users' tastes graphs and received topics, a set of suitableusers and relevance confidence weights is produced.

Each user has a ‘volume’ setting for their registered Client Devices 301that specifies the frequency which notifications appear. This can‘learn’ user behavior, according to whether notifications are viewed ordismissed.

On the basis of the volume and relevance confidence, the System 100 canmake one of three decisions:

-   -   Match: Send an update to the user    -   Wait: The topic is a good match, but the user has recently been        sent an alert or    -   Block: The user volume and match confidence weights are low and        so the push should not be sent. This can be detected in the        initial user matching.

To prevent never ending processing of inactive users, an overridingsetting automatically tapers out alerts (exponential back off) as timepasses without user activity.

External Search engines

The PubSubHubbub protocol is used to publish real-time results to Searchengines for indexing (e.g. Google, etc.)

Directed Search block 506:

Directed searches include query processing for directed search, and thenecessary interaction with the search index to produce a search result.

Custom Queue Block 508:

The query delivers custom result sets generated by combining user tastegraph data with the search index. The queue offers options to clientimplementations to choose different options for ordering a result list:

-   -   Time: Prioritize recent results (strictly or loosely)    -   Relevancy: Prioritize the most relevant results    -   Context Spread: With each result set, include a mixture of        contexts, for example: some news, entertainment, sport, etc.

Relevance is determined by topic and interest confidence matching and isdriven by Lucene, or an alternative search engine's, native relevancymechanisms.

Feedback Block 507:

Feedback includes the processing of explicit and implicit user behaviordata to manage the index relevance rankings and user graph data.

Usage feedback forms are an important aspect of managing the index andtuning the relevancy algorithms within the System 100.

Feedback can be categorized as:

-   -   Explicit: When a user shares a clip or rates a clip positively        or negatively.    -   Implicit:        -   When a user selects the clip from a list.        -   When a user presses skip/stop before completion, or any            other user behavior that may give feedback as to a user's            feelings towards a clip.

This feedback serves many purposes throughout the System 100, including:

-   -   Search Ranking: Rated and Shared clip are boosted within the        index. Negatively rated clips, or those skipped or stopped        prematurely, are diminished within the index.    -   Segmentation: Premature skip or stop actions at the end of a        video may indicate a clip needs to be shortened.    -   Directed Search: Keywords entered as direct search queries serve        to boost their related keywords in trending topic results.    -   User Specific: Client applications, such as mobile apps or        websites, allow users to rate items.

Frequent negative ratings indicate a problem with the active video item,either in terms of content quality or relevance. Negative ratings areapplied to the index as a whole, in order to suppress poor qualitycontent from the core index's base.

Additionally, the taste graph of the user who negatively rated the itemis updated because a negative rating towards content within a certaintopic context may indicate that the user is less interested in thetopic. Positive ratings boost overall index rating for individual topicitems and within that users' taste graph, the assumed interest in theassociated topic for the user.

Directed Search Block 506

Directed search is the result of specific user input into a client'ssearch function.

Reporting

Reporting is required for user behavior analysis and overall productperformance. This includes the number of active users, video itemsconsumed, and other indicators as a high level product overview. Deeperreporting may also be provided in parallel with the Feedback analysis toindicate algorithm performance.

Advertising

Along with paid subscription agreements, advertising can be a basis forrevenue generation. Where advertising is used, the topic and relatedontologies of individual items can be leveraged to provide more contextrelevant advertising results. Advertising models also require additionalreporting, to define performance and potentially to execute revenuesharing agreements.

Because the System 100 recognizes the words within the program and theresult is in response to a user's directly or indirectly indicatedinterests, we can provide highly relevant advertising around the contentand therefore, highly specific advertising for individual users. Forexample, where traditional advertising might recognize that certaindemographic watches Piers Morgan on CNN, the disclosed System 100 canrecognize that Piers Morgan is on CNN, talking with or about Lady Gaga,a celebrity pop musician, who in turn is talking about her new shoes inNew York. While the applications of this context awareness in deliveringcontent have been described, this context awareness can also be combinedwith advertising inventory to provide highly relevant and targetedadvertising results.

Because advertising is based on the topic context, some clips will havea higher monetary value than others. These high value clips can beexploited within the algorithms to play advertising more often thanthose clips with less contextual value, although care is taken so thatthe introduction of advertising is not damaging to the user experienceand does not cause an unnecessary increase in negative feedback.

To gather more direct user feedback, client implementations includefunctionality to prompt the user for specific reasons when negativefeedback occurs. These reasons may include asking whether the rejectionwas related to relevance, video quality, timing (segmentation) issues,etc. This provides a tool during testing and ongoing usage as to analternative option to client advertising and subscription fees.

Monitoring

Monitoring services are required to track the status of the platform asa whole, the throughput/performance of its subsystems, and any errorsthat may be occurring. Within the cloud environment 204, this monitoringcan be used to manage the number of available server ‘instances’ andautomatically scale capacity to meet user demand.

Platform Outputs

After processing, the engine is left with the topic information and theuser to send it to. The details of specific client API's for real-timemessaging can include:

-   -   XMPP Support for Web Clients    -   Apple iOS Push Notifications for iPhone OS4 and later.    -   MQTT protocol for Android/RIM and mobile devices without        platform enabled push support

The Search Platform 120 Outputs information about a specific clip thatusers have requested. This information is not the clip itself; insteadit is used by the Media Storage and Delivery Platform 130 in order tocall up specific clips requested by users.

Detailed Media Storage and Delivery Platform 130

Video content is stored in a scalable data store, with streaming serversto support delivery to clients.

Platform Inputs

The Media Storage and Delivery Platform 130 receives Video/Audio streams404 from the Capture Platform 110 and stores them with a unique timestamp and source data. It is this time stamp and source data that theplatform references when a request is processed.

Platform Functionality

Storage

Cloud-based storage 204 provides advantages by reducing initialinfrastructure costs and by removing ongoing infrastructure managementcosts; while at the same time, providing the benefits of local andgeographic redundancy. In addition, cloud storage providers include aset of supporting services (e.g. Contest Distribution Network or “CDN”delivery).

Video files from live content may, for example, be divided into1-hour-long blocks. This 1-hour duration is purely a mechanism to dividethe content and make management easier—the actual length of contentstored will be varied and can be adjusted within the Storage Platform.

Thumbnails

The storage server is also responsible for storage and delivery of videothumbnails. These are generated as the stream is delivered through theCapture Platform 110 and stored as static files for delivery via CDN.These thumbnails are organized in a folder hierarchy incorporating thesource and date information.

Delivery

As discussed in the segmentation process, topics generate segmentedlists of videos only if absolutely necessary. The majority of the timethe delivery server extracts stream segments and wraps them in filewrappers on demand. Where on demand transcoding is not possible,baseline profile files are created, for example in different outputdimensions.

Digital Rights Management (DRM)

DRM can be applied using licensed third party plugins to the deliveryplatform.

Platform Outputs

The Media Storage and Delivery Platform 130 outputs video segments tothe Front-End Client 140 to be viewed by users.

Front-End Client 140

Platform Inputs

The Front-End Client 140 receives input from two specific sources.

User Inputs

When a user inputs a specific search into a search field within theclient itself.

Taste Graphs:

The System 100 uses taste graph information to generate custom videorecommendations to users. To simplify generation of a user's tastegraph, the clients exploit social APIs.

Merely by submitting taste graph information, the System 100 can build apicture of global user interests and popular terms. This informationserves as a basis for relevancy calculations.

Because the System 100 algorithms feed from such a broad set of inputs,traditional Information Retrieval test strategies are employed to buildconfidence in various algorithms.

-   -   Facebook Open Graph: Users may use Facebook Graph API        functionality to connect their account to their Facebook        profile, where a rich dataset of tastes and interests can be        retrieved.    -   LinkedIn: Similarly, LinkedIn offers Open Standard for        Authorization API's through which a graph of connections, in        particular related to organizations, can be retrieved.    -   Twitter: Twitter also uses Open Standard for Authorization        derived API to retrieve a user's Twitter graph of interests.    -   Others: Any other API that can produce an interest graph for a        user can be used by the client device(s), so long as the        interests are delivered in a standard format to the Search        platform 120.

Users are able to generate, augment, and modify their compiled tastegraph. However, this should be managed using the server generatedcompiled and processed view of their profile data, rather than theoriginal social graph data, enabling the platform to reimport thissource data and apply the same corrections.

Infrastructure

The Infrastructure of this platform is largely vague. In general, aFront-End Client 140 can be defined as anything that has access to theSystem 100, so creating a specific description may limit the scope ofthe System 100's capability at a later date.

Feedback

As a user interacts with a client device, usage feedback is delivered tothe Search platform 120 services as an aid to tuning search and indexingalgorithms. This feedback is described as part of the Search platform120 outline.

Front-End Client 140

API interactions with the platform APIs are a component of the Searchplatform 120 specification.

Client Devices 301 are defined as any mobile, website, IPTV 202 devices,smart televisions, set-top boxes, gaming platforms, in-car entertainmentsystems, or other web capable device.

Video List

The majority of client user interfaces include a list of videos,generally displayed as a thumbnail, and Text Metadata 407 that includesprogram title, topic, video source, and duration.

Video Lists are generated according to:

-   -   Custom Video: (based on the user's Taste Graph)    -   Trending Topic: Based on trends    -   Directed Search: Results for keyword lookups    -   Popular: Most Shared, Most Viewed, etc.    -   Category/Topic Based: e.g. related to News, Politics, etc.

Trending Keywords

As described in the platform description, trending topics may berepresented on Client Devices 301.

Social APIs

To simplify generation of a user's taste graph, the clients exploitsocial APIs.

-   -   Facebook Open Graph: Users may use Facebook Graph API        functionality to connect their account to their Facebook        profile, where a rich dataset of tastes and interests can be        retrieved.    -   LinkedIn: Similarly, LinkedIn offers Open Standard for        Authorization API's through which a graph of connections, in        particular related to organizations, can be retrieved.    -   Twitter: Twitter also uses Open Standard for Authorization        derived API to retrieve a user's Twitter graph of interests.    -   Others: Any other API that can produce an interest graph for a        user can be used by the client device(s), so long as the        interests are delivered in a standard format to the Search        platform 120.

Users are able to generate, augment, and modify their compiled tastegraph. However, this should be managed using the server generatedcompiled and processed view of their profile data, rather than theoriginal social graph data, enabling the platform to reimport thissource data and apply the same corrections.

Feedback

As a user interacts with a client device, usage feedback is delivered tothe Search platform 120 services as an aid to tuning search and indexingalgorithms. This feedback is described as part of the Search platform120 outline.

Real-time Updates

Push mechanisms are used on various devices to support immediate alertswhen relevant content appears. The mechanisms to support this aredescribed in the Media Storage and Delivery section.

Note: there is a client use case scenario where users are alerted thatcontent is currently being broadcast, without actually delivering thevideo (“Lady Gaga is now on Channel X!”).

Wi-Fi Prediction

For mobile devices, the type of connectivity can be determined usingdevice specific APIs to determine whether the data network connection isvia Wi-Fi or mobile network (3G/GPRS/Edge/etc.). The client tracks thisconnectivity so that when a push notification arrives, the client canautomatically chose to delay presenting the notification to the user ifthe client is likely to connect to Wi-Fi before a timeout period.

The predication of Wi-Fi availability may be based on data collected onthe device of historic behavior, with consideration of the days of theweek and public holidays. This mechanism helps to reduce mobile networkdata consumption and the possible associated data charges. The higherbandwidth connection speeds of Wi-Fi also enable delivery of higherquality video than might be possible over mobile network dataconnections.

Administration

API functionality may be exposed for administration tasks, such asblocking keywords or topic items.

API Design

A REST based API handles client interactions. Which, because of itsclient facing nature, is cached and distributed. Video Queues aredelivered in a format based on a MediaRSS protocol in multiple outputformats, e.g. RSS, ATOM, XML, JSON, etc. Although there are elementsthat require extension of the protocol, its use provides a base for thedata format design.

Supplementary Requirements

Hand-picked ‘Featured’ items, analytics and any other web supportingservices are left to the client implementation.

Example Clients

Web Client

FIG. 10 illustrates an example web client interface and its functionalcomponents. This interface demonstrates the use of a number of platformelements:

Trending Topics 1001:

As delivered from the Trending Topics engine. The small ‘pin’ iconbeside the ‘Trending’ title text in this example is described as alocation pin, suggesting that the results are local to the user, ratherthan the being a ‘global’ trending topic. This is implemented using thelocation segmentation as described in the Trending Topics section ofthis document.

Custom Video Queue 1002:

The Users' Custom Video Queue is displayed as a series of thumbnails atthe bottom of the page. This is custom to the logged in user, so itwould be replaced with a login button if no user is currently signed in.

Discover 1007 replaces the custom video queue with a series of featureditems, drawing these results from the Custom Queue block 508 of theplatform (optionally filtering results to match a user's location).

Shared 1003:

‘Shared 1003’ provides a similar queue format, to discover 1007, withthe number of times an item has been shared 1003 (on social networks,Twitter/Facebook/etc.) as the result order.

Video Player 1004:

The active video item, as selected from the active custom queue,discovers or shared 1003 tab.

Transcript 1005:

The transcript of the active video is displayed. This can be deliveredfrom the data store based on the topic source/channel and time.

Search Box 1006

The Search Box 1006 is where a user would type in a request for aspecific video.

IPTV Client

FIG. 11 illustrates an example IPTV client and its functionalcomponents.

The IPTV interface example offers a little less functionality than theweb example, but displays an ‘App’ Selection 1101 at the bottom of thepage, and above that, the custom queue of video items to select from.

The video thumbnails show the view after the user has connected to theirsocial account to configure their interests, or, like the web based‘Discover 1007’ list, featured/top ranked items for users who are notconnected.

Mobile Client

FIG. 12 is an illustration of an example mobile client. Again, like theWeb App, the exemplary web app described above, a custom queue isdisplayed, along with the ‘Discover 1007’ and ‘Shared 1003’ buttons.

The ‘Connect’ button 1201 in at the top of the example image enables theuser to connect using Facebook's Open Graph API, as a source for tastegraph interests.

A ‘Refine 1001’ button at the center of the bottom tab panel is providedleading to a list of taste graph interests that the user can modify.

Text Data Processing

For the purposes of explaining the System 100 design, subtitles will beused as the example data source because they provide ‘chunks’ of textwith associated timestamps. Data can also be supplied by a contentprovider's API or through voice recognition software, in which case theSystem 100 follows the same methodology except that it considers thecomplete body of text as a text ‘chunk’.

The timestamp for subtitle data is not the time that the CapturePlatform 110 receives the text, but the timestamp that the sourcedefines. For subtitles the Coordinated Universal Time is used toidentify when the word or words appeared on screen.

In addition to the text data subtitles also provide encoded data, byusing color to represent different speakers and text markers fornon-text actions such as [APPLAUSE], [LAUGHTER], [TRANSLATION], etc.

Subtitle text in particular is prone to errors in the initialtranslations, as are Optical Character Recognition (OCR) and voicerecognition processing. Or if delivered Over-the-air 203, originalsignal corruption can cause errors in initial translation.

This incoming text chunks can be visualized as a stream, as shown inFIG. 6

Source

Context of the data source is also received by the System 100,including:

-   -   Source Location (US, UK, Global, etc.)    -   Source Description, including, when available:        -   Text Title and Description of channel or item        -   Timing Information: Duration, Start Time, End Time (as is            the case for television programs)        -   Content Rating: i.e. Adult, Violence, etc.—in the MediaRss            “media:rating” specification format        -   Source Category: News, Entertainment, Sports, etc.    -   Context/Tags (API provided content)    -   Language

Supporting Data Sources

To help in the process, additional information is supplied:

-   -   Dictionaries: to provide a look up source context information    -   Nielsen Data: to provide additional program context,        demographic, and rating information    -   Online Resources: Available online resources are used to        generate dictionary content (DMOZ, Y! BOSS, Amazon Alexa API)    -   EPG Data & Feeds: Ad Break Timings, Extended (commercial) feed        (e.g. Actors, Genre, etc.) are used where available for the        source.

Text Input Architecture

Although each text chunk is associated with a number of parameters (e.g.source channel, location, etc.), the implementation does not requirethat this information be explicitly passed from the Capture Platform 110to the search platform 120.

The System 100 can use source IDs and timing data combined to reduce theamount of data encoded directly alongside each text chunk that is passedto the search platform 120. And therefore, reduce processing needs andresource consumption.

Outputs

This process outputs segmented topics that contain the complete text ofthe topic, a start and end time, and additional classificationinformation. This additional classification data can include the primarytopic and weighted categorization data such as location, people,organizations and keywords related to the topic.

Process

As illustrated and previously discussed with respect to FIG. 2, Incomingtext is processed in a series of queues, or buffers, for semanticanalysis and processing. The sections that follow describe theseprocesses with further reference to FIGS. 6, 7, and 8. In the context ofFIG. 2, the below describing word tokenization (step 1), categorization(step 2), and sentence extraction (step 3) are generally performed inthe Text Decoding Server 208, whereas the remaining steps are generallyperformed in the Topic Extraction & Segmentation component 212 inconjunction with its reference to the contextual database. Theapplication of this methods to system elements described here can beimplemented according to system design techniques with knowledge of therelevant design needs in a particular implementation.

Step 1: Word Tokenization

Text ‘chunks’ are divided into individual words.

-   -   Tokenization: Latin alphabet derived languages may be tokenized        by whitespace and punctuation (period/full stop) separation.        Other languages vary. For example, in Modern Japanese and        Chinese no white space separation exists, and Arabic words are        often a conjunction of several sub-words.

Stemming:

Where applicable, stemming algorithms are applied to the source words.The basic function of a ‘stemming’ algorithm is to enable matching ofwords based on their root, for example, a stemming algorithm extractsthe root word “connect” from the words “connected, connecting, orconnection.” Numerous algorithms are available across a broad languagebase.

Phonetic stemming can also be applied to compensate for errors in thetext, in particular when the text is derived from a voice recognitionmechanism. Phonetic stemming compensates for language irregularities byproviding word alternatives based on the source word's phoneticstructure. For example, it will match “see” with “sea” and “through”with “threw.”

‘Stop Words’:

These are words which should not be indexed, such as in English: for,of, at, a, and, the, etc. These words are not removed from the buffer,as they form part of the semantic analysis for categorization later inthe process, but they are not considered when determining a sentence'scontext. Instead, these words are flagged as stop words and are ignoredwhen looking for context.

‘Parts of Speech’ Tagging:

The incoming text is processed and words are marked with their assumed‘part of speech’ such as noun, verb, adjective, etc. Detected nouns areof particular interest to the System 100 later in processing, aspotential sources of context information such as locations, people, andorganizations. These words are usually the best indicators of a textstream's context.

TABLE 1 Text Buffer After Tokenization TIME WORD STEMMED . . . 1:01:24Gaddafi gaddafi 1:01:24 . — 1:01:33 We — 1:01:33 report report 1:01:33from — 1:01:33 The — 1:01:33 Libyan libya 1:01:33 capital capital1:01:33 . — 1:01:37 Muammar muamar, muammar, muamer, meuamer 1:01:37Gaddafi's gaddafi 1:01:37 tanks tank 1:01:37 Are — 1:01:37 ready —1:01:37 outside outside 1:01:37 The — 1:01:37 western west 1:01:37 towntown 1:01:37 Of — 1:01:37 Zawiya zawiya 1:01:37 . — 1:01:41 Preparingprepar . . .

Step 2: Categorization

Dictionary look ups are performed to create an extended graph ofassociated topics with source tokens:

Morphologic Dictionary Lookup:

The basic description of this process is that words are matched againsta database of morphologically related words. For example, “airplane” isrelated to “aircraft,” “fly” (“flying”), “passenger,” “pilot,”“airport,” etc.

Assumed Relationships Dictionary Lookup:

In addition to the more explicitly defined relationships in theMorphologic dictionaries, a database of more loosely coupledrelationships is maintained and updated as text is received. Thisdictionary provides a further source of related data for each of theword tokens.

This dictionary is generated by storing relationships between nounsidentified in the parts of speech tagging process. For example, theEnglish sentence, “In Brazil, the famous carnival has begun in Rio,”leads to associations where “Brazil” is related to “Carnival,” “Rio”;“Carnival” is related to “Brazil,” “Rio”; and “Rio” is related to“Brazil,” “Carnival”. In this way, important information containingmultiple words is not discarded or sectioned into unrelated topics.

This helps the System 100 to recognize that the context of a topic hasnot changed between one sentence discussing “Rio” and the next thatmentions “Brazil”.

Categorization Dictionary Lookup:

This lookup attempts to assign categories from predefined databaselists. This data is used:

-   -   To apply context: Such as associating “goal” and “football” with        “sport,” “rain” with “weather,” “Elvis” with music, etc.    -   To assign categorization: for example “Lady Gaga” is a “Person”        in “Music”; “Apple” is an organization, “Libya” is a “location”.    -   To augment data:

For example, in location based lookups, “London” can be expanded to“Location,” “London, England, United Kingdom, Europe”.

A good example is a location based lookup, where “London” can beexpanded to “Location,” “London, England, United Kingdom, Europe”.Rather than pure nouns, keywords such as: “scored” can also be used.

Other lists include:

-   -   Places (as described)    -   People (Celebrities, Authors, Actors, Musicians, etc.)    -   Organizations (Microsoft, IBM, Verizon, etc.)    -   Events (Christmas, Summer, War, Famine, Drought, Concert)    -   Media (books, songs, films, games, etc.)    -   Objects (animal, vegetable, mineral—although crossover exists        with morphological database)

Dictionaries are assumed not to be complete and only serve to help thecategorization process. These dictionaries are generated by analyzingtaste graphs (as described later) for popular terms and by gatheringinformation from available sources. Online resources and search toolssuch as DMOZ, Yahoo! BOSS, Amazon Alexa API, and others are used as datasources for dictionary generation.

The functionality of these dictionary lookups includes:

Weight/Confidence:

As we categorize tokens with associated topics, we assign weight orconfidence values that define how important we believe a word to be.

The base of this weight or score is the count of historic mentions of anexpression on a channel. For example, if CNN mentions Obama a thousandtimes a day, and President Karzai 30 times, then given the sentence“Obama met with President Karzai today,” we would assume PresidentKarzai to be the dominant topic of the sentence and as the sentences arecombined, the discussion as a whole.

The weighting algorithms are also driven by dictionary source. Forexample, a location dictionary result is weighted higher than a resultfrom context relationships dictionaries, such that we can say thesentence “In London David Cameron met political leaders for talks” isstrongly weighted/almost certainly about London and the UK and probablyrelated to David Cameron and by association, politics. In addition tothis dynamic collection of previous mentions, we also draw on therelated metadata

Stream Context:

As each sentence is processed (sequentially as they arrive), theseconfidence values are aggregated to provide a dynamic set of keywordsand related contexts that define what we believe the most importantkeywords are and the current context of the conversation. In addition tothe identified entities, we also gather context from.

Source Program Data:

Where available, inbound data may also be tagged with context such as“News,” “Sport,” “Entertainment,” etc. Such is the case for much of theAPI provided content. For broadcast content, over-the-air 203 orcommercially supplied EPG data can be used to provide categorization andcontext information.

External Sources:

In particular, broadcast content uses Nielsen data as an additionalsource for categorization. Commercial EPG data feeds are also available;some even define advertising break schedules.

Recent Stream Topics:

For example, a text stream with mentions of “goal” and “football” wouldlead to a strong weighting towards a context of “sport”. As additionaltokens are processed, a dictionary lookup for “Liverpool” is moreweighted towards Liverpool the football club, as opposed to Liverpoolthe city.

Stream Context:

As data is received and categories assigned, the context of the streamis identified. This context is gathered from:

Source Program Data:

Where available, inbound data may also be tagged with context such as“News,” “Sport,” “Entertainment,” etc. Such is the case for much of theAPI provided content. For broadcast content, Over-the-air 203 orcommercially supplied EPG data can be used to provide categorization andcontext information.

External Sources:

In particular, broadcast content uses Nielsen data as an additionalsource for categorization. Commercial EPG data feeds are also available;some even define advertising break schedules.

Recent Stream Topics:

For example, a text stream with mentions of “goal” and “football” wouldlead to a strong weighting towards a context of “sport”. As additionaltokens are processed, a dictionary lookup for “Liverpool” is moreweighted towards Liverpool the football club, as opposed to Liverpoolthe city.

Multiple Word Nouns:

Each of the dictionary lookups described includes support for multipleword associations. For example, if the word “European” in the tokenbuffer is followed by “Union,” a stronger weight is applied to thecomplete results, as opposed to either word individual word.

Unlike a filtering process, weighing is applied so that a result of“Middle East Politics” does not nullify results for “Middle East”. Whena query is made against the index for “Middle East Politics,” the highermatch ensures a higher ranking for the more complete match in theresults.

Definite Articles and Capitalization:

According to individual language rules, the dictionary is structured tosupport passing preceding word types, to assist in distinguishingbetween “Beatles” and “The Beatles”. Although correct capitalizationcannot be assumed, capitalized nouns also play a role in result weight.For example, organizations and places are capitalized: “Apple” and“apple,” “middle east,” “Middle East”.

TABLE 2 Table Representation of Buffer Data After Lookup COUNT TERM STEMMORPHOLOGIC ASSOCIATION CATEGORY 1 Muammar {Libya, 0.823} {People:Gaddafi 0.821} 2 Gaddafi gaddafi {People: 0.721} 1 Muammar muamar,{People: muammar, 0.651} muamer, meuamer 1 Tanks tank {soldier, 0.631},{war, 0.732}, {cannon, 0.331}, {Libya, {enemy, 0.831}, 0.234},{military, 0.731}, {Afghanistan, {fire, 0.331} 0.436} 1 outside outside1 west 1 Town town city 1 Zawiya zawiya {Libya, {Location: 0.913},{0.921, {Muammar, [“Libya,” 0.323}, “Maghreb,” {Gaddafi, “North 0.223}Africa,” “Africa”]}

Step 3: Sentence Extraction

The next stage of processing is to divide the token stream into sentenceblocks. This is performed by using a separate segmentation process. Thefunction of this segmentation process is to provide the best match forsentence or topic start/end times.

Similar to the token stream, at the core of the segmentation process isa buffer of time values and confidence values. These confidence valuesare weights are further illustrated in FIG. 7.

These confidence weights are generated using a number of indicators:

Syntactic Indicators 801:

Many languages define a period (e.g. Modern Latin derivatives, English,Japanese, Chinese, etc.) to denote the end of a sentence structure. InEnglish, capitals commonly indicate the beginning of a sentence. Careshould be taken towards exceptions such as periods after abbreviations.

Timing Indicators 802:

In the case of subtitle information, pauses in the source texttimestamps indicate pauses in original spoken words. This information isvital to determining the source chunk length of a subtitle source.

TABLE 3 Example Timing Data TIME TOKEN . . . 0:00:00 And 0:00:02 In0:00:02 London 0:00:02 today 0:00:02 , 0:00:02 There 0:00:02 Were0:00:06 Riots 0:00:06 . 0:00:09 Protestors 0:00:09 marched . . .

The System 100 recognizes that the 3 second gap between “riots” and“protestors” is more significant than 4 second gap between “were” and“riots,” simply because the combined chunk sentence, “in London today,there were” would have taken longer to say.

Source Timing Patterns 803:

Many television sources can be assumed to follow a programming pattern.Content is traditionally delivered in 15, 30 or 60 minute blocks, sothat the 0, 15, 30 and 45 minute markers of every hour are weighted aspossible indicators of context change.

Source Sentence Length 804:

Historic data and language is analyzed to provide an expected phraselength generated for each source. This is a weighted value, stacked toprevent obvious failures like 60 word sentences, and does not assumespeakers will follow a strict speech pattern.

Text Encoded Data 806:

In addition to the actual text, subtitles contain additionalinformation, such as:

-   -   Color: generally used to distinguish between speakers,        indicating a change in sentence.    -   Explicit speaker definition, followed by a colon; e.g. “Charlie:        . . . ”    -   Non-text activity (often capitalized and/or in square brackets        such as “[LAUGHTER],” “[MUSIC],” “[APPLAUSE]”.    -   Changes in speaker or non-verbal activity are used as indicators        for natural sentence breaks.

Source Defined Data 805:

EPG data provides explicit program start and end times to use for startand end time definition.

Note: some programs overrun and pre-roll credits.

Some data, such as that provided by Content API 408 Providers such asReuters and Associated Press, is delivered as a segmented block with asingle ‘published’ time, which acts as the start, end and content time.In these instances, sentence extraction is not required, the articleacts as a single already defined topic.

Extended EPG Data 807:

Advertising break timings are used when available.

Expression Dictionary 808:

Expressions are defined, on a per channel basis, to mark start and endexpressions. Expressions beginning with: “Now on . . . ,” “Hello . . .,” “Next . . . ,” “Welcome to . . . ,” “The Headlines,” etc. are a goodindicator of a segment beginning, while expressions such as “ . . . backlater,” “ . . . Goodnight,” etc. are a good indicator that a segment hascome to a close.

Audio/Video Analysis 809:

Algorithms to process video scene change are available as part of keyframe generation routines in many video encoding implementations. Audioanalysis is cheaper in terms of processing power, and a sudden change involume can be used to indicate a change in context. Sudden audioadjustment is a particularly common feature of broadcast advertising andcan be used to guess when a television program has ended and commercialshave begun.

TABLE 4 Simplified View of Break Indicators TIME WORD COLOR BREAKINDICATORS . . . 1:01:24 Gaddafi blue Capitalized 1:01:24 . blue EndMarker 1:01:33 We yellow Follows Color Change, Follows Period,Capitalized, Follows Pause [9 secs, 1 word] 1:01:33 Report yellow1:01:33 From yellow 1:01:33 The yellow 1:01:33 Libyan yellow Capitalized1:01:33 Capital yellow 1:01:33 . yellow End Marker 1:01:37 Muammaryellow Follows End, Follows Pause, Capitalized, Follows Pause [4seconds, 6 words] 1:01:37 Gaddafi's yellow Capitalized 1:01:37 Tanksyellow . . .

Using this data buffer, the System 100 can divide the incoming text intosentences. An example is given in the table below:

TABLE 5 Example Result of Sentence Extraction TIME SENTENCE . . .1:01:24 Residents say it is now quiet after an intense day of fightingwith forces loyal to Colonel Gaddafi. 1:01:33 We report from the Libyancapital. 1:01:37 Muammar Gaddafi's tanks are ready outside the westerntown of Zawiya. 1:01:37 Preparing to take back with brutal forceterritory lost in recent days to the regimes opponents. . . .

Step 4: Grammar Checking

Sentences are examined and a score generated according to grammaticalcorrectness, based on:

Density of Unexpected Characters:

“Now we w&*AS”!## so double the killer delete select all.”. Sentencesthat contain unrecognized tokens containing non-alphanumericcharacters—as flagged with a below average grammar confidence score thatlowers the weight value of the related contexts and topic in the indexso that this entry is less likely to appear as a result.

Sentence Length:

Lost data may make end of sentence detection difficult: e.g. missingperiod characters within the sentence structure. By examining historicchannel data, we define average and expected maximum sentence lengthsfor channel content. As this expected length is reached and exceeded, weincreasingly aggressively identify potential sentence break points,using timing data (a pause between expressions), capitalized words andparts of speech tagging (e.g. definite article ‘The’ followed by noun)to detect a potential sentence break.

Formal Grammar Rules:

This step is performed using available libraries, although the effect ofthe grammar score on the overall topic ranking is related to theconfidence in these algorithms. A channel known to adhere less toconventional grammar rules will weigh this category less than a channelknown to adhere more frequently to conventional grammar rules. Forexample, CNN evening news cast will weight this category as moreimportant than say, MTV.

It is expected that many ‘valid’ spoken text streams will containcolloquialisms and grammatical irregularities, so over-all these checksshould not greatly impact the topic relevance weight.

Step 5: Sentiment Analysis

Sentences are examined to produce positive or negative sentiment values.Dictionaries are provided to weigh individual words with individualvalues, with care taken towards modifiers like “not good,” “super bad,”“very happy” and broader analysis of overall semantic context, “I cannotsay I am happy.”

The dictionaries that serve the source weight are balanced by context,such that, for example, superlative heavy sports content does not overlydominate the resultant weight (win, score, fantastic goal).

When a topic has been extracted, this sentiment information isaggregated and stored with the topic as the overall sentiment.

Step 6: Topic Extraction & Segmentation Component 212

FIG. 8 represents the processes by which a clip's Start and End timedata is generated.

At this point in processing, the buffer can be said to consist of aseries of sentences, each with weighted primary, related keywords andcontext associated with it.

When a new sentence arrives, it is assigned a unique Topic ID. Thesentence is compared to recent/historic sentences and related keywords.The System 100 then determines if the sentence is a match for recent orhistoric topics. If a sentence is a strong match, it is assigned to arecent or historic sentence Topic ID. If a match cannot be made, thesentence retains its unique Topic ID.

While new sentences are being assigned topics on a sentence level, thebuffer contains a series of sentences and Topic IDs that are “waiting”for a change in context. The end Topic ID of the buffer is generatedalgorithmically, using a number of parameters:

Source Defined Timeout:

Depending on the data source and time, topics can be expected to be acertain duration. For example, a news channel might be expected ingeneral to create several 30 second topics at the start of every hour,followed by 10 minute topics for the remainder of that hour. Anentertainment program might be expected to produce four 15 minute topicblocks, such as interviews, over the course of an hour.

Timeout:

After a period of time passes during which no mention of the originaltopic or related keywords is made, the System 100 assumes that thediscussion of that topic has ended.

Topic Category Confidence:

If a topic has a strong confidence in relationship to a particularcategory and new sentences arrive strongly connected to an unrelatedcategory, this is an indication that the topic has changed to anunrelated subject.

If a series of sentences mention “sport,” “football,” “goal,”“Liverpool,” the System 100 has a confidence in ‘Sport’ as the currentcontext with ‘Liverpool, UK’ as a location. Should the sentence thatfollows include “Libya,” “government” and “protests,” a change incontext to news/politics is assumed.

Expression Dictionary 808 (Keywords & EPG Data 807):

Sentence extraction will supply heavily weighted ‘hard stop’ actions. Inthe news, a sentence beginning with “The Headlines:” is a goodindication that the previous topic has ended. Similarly, EPG programstart and end timing indicates a topic change. FIG. 8 represents theprocesses by which a clip's Start and End time data is generated.

Once a topic end has been identified, this completes the process ofTopic Extraction & Segmentation 212. The category and related keywordconfidence of the topic as a whole is analyzed and combined alongsidethe aggregated text and start/end times.

Video files are not necessarily physically broken into topic segments inthis process. Along with the source channel, the start and end time issufficient for the Media Storage and Delivery Platform 130 to retrieveand play requested topics. This enables the System 100 to reprocess andredefine topic start and end times without the overhead of cuttingand/or merging video files.

Step 7: Complete, Segmented Topics

The completed topic is made up of the following data:

-   -   Source:

Location and channel. Can be defined as a database ID

-   -   Program Data:

Title, description, program (adult) rating, category

-   -   Start/End Times:        The start and end time for the video. In combination with the        source (location and channel), this is used to form the content        request to the Media Storage and Delivery Platform 130.    -   Text:        The content of the topic    -   Category Information From Dictionary Lookups:

People

Location

Organization

Event

Media: music/book/film/game/etc.

-   -   Extended Meta Data:

Sentiment

Nielsen Ratings

This data is passed on to a number of processes, including the index forstorage, the trending engine and the real-time updates 503 component.

Trending Topics

As completed topics are detected, their related keywords are pushed tothe ‘trending’ topics engine, which maintains sets of trending topics.The concept of trending topics or trending stories is one that is widelyused, with many active examples online (Breaking News, Twitter, etc.).

The trending engine is a list of query values:

-   -   Time: currently trending, last hour, today, last week, last        month, etc.    -   Location: UK, USA, Europe, etc.    -   Category: Music, Film, News, Entertainment, etc.    -   Source: Channel or API that provided the trending topic

Each topic generated carries with it not just the active keyword text,but also a confidence score and the accompanying metadata—all of whichis aggregated within this engine.

This enables the System 100 to generate rich trending topics lists assets of query objects, for example in the pseudo JSON format below:

[ { text: ‘Mummar Gaddafi,’ category: ’People’ }, { text: ‘Apple,’category: ‘Organisation’ }, { text: ‘Libya,’ category: ‘News,’ location:‘Libya’ }, { text: ‘Volcano,’ location: ‘Hawaii’ } ]

By using this format, when trending keywords such as ‘Apple’ arequeried, only the results relevant to the organization are returned.

Filters

Filter lists are applied to trending topics, formed from:

-   -   Stop Words:

Words that are not relevant to trends:

-   -   Non-Important: The stemming routine flags words such as English        words ‘the,’ ‘and,’ ‘but,’ as non-important.    -   Context keywords: such as the word ‘sport,’ business' or        ‘weather’ should not become trending keywords    -   Channel/Source Based words: On CNN, the station may frequently        refer to the channel name or anchor names: these words should        not become trending results    -   Days of the week: begin to appear as ‘breaking’ if they are not        properly managed.

Frequency analysis of existing data works to prevent these words frombecoming trending topics.

Admin Override

Allows the Front-End Client 140 Admin Users to temporarily orpermanently ban words from the trending topics lists in case a wordslips past the frequency analysis algorithm. In this way, the System 100eliminates words that users would otherwise not be interested inquerying.

Ranking

Ranking determines the likelihood of an given topic appearing as theresult of a user queried search, or of a user seeing a topic in aclient's user recommendations. Ranking of a given topic is based on:

Frequency Analysis Algorithms:

Frequency analysis algorithms are used to determine trending topics.

Trends:

Trends are computed across all sources. Those with broad coverage acrossmultiple sources are ranked higher than those trends discovered on asingle source or a limited number of sources.

External Sources:

Where Terms of Service allow, external sources are mined and ranking areboosted where matches are found (for example, comparing television trendstories with those on internet news sites.)

User Behavior Feedback:

Direct searches and specific user topic selection (clicking on a Topic)combined with algorithm balancing, serve as valuable indicators that atopic is popular. These algorithms may be carefully balanced to ensurethat prominently placed keywords do not become self-fulfilling.

Directed Search

Directed search is the process of providing search results for userqueries. These queries can come from text box entrees or by trendingtopics requests (as opposed to the taste graph search for multipleentries). These queries can be defined as:

Query Types:

User Input Based:

Which simply means that a user has entered a query for a specific topic,such as Lady Gaga. Processing can include query tokens such as AND, OR,(“), etc.

Context-Based:

Searches that contain context information or filters. For example,“location: London” returns only results from London, and in the case ofa combination: “riots location:Libya” will return only results relatedto “riots” in Libya.

One index implementation uses “ElasticSearch,” a scalable derivative ofApache Lucene (or its' web enabled sister product Solr). This providesscalability and query processing, but another implementation could bechosen to provide equivalent functionality.

Outputs

The result of a search query is a response document containing an arrayof results. This result format is based on MediaRSS structure and isprovided in multiple formats so that the client device can choose thebest suited structure (for example, JSON).

Below is a simplified result item sample:

item: { ′id′: ′12345-12345-12345-12345-12345,’  // System UID′timestamp′: // Last Update Time ′title′: ′Item Title′ // ′description′:′Item Description′ ′country′: ′GB′ // ISO ′3166-1-alpha-2′ Country Code′mediaGroup′: { ′item′: { ′type′: ′video,’ // ′video,’ ′stream,’′article,’ etc. // Array of rating codes, custom or variation on e.g.:// ICRA Ratings: http://256.com/gray/docs/pics/icra.html // MPAARatings: G, PG, PG-13, R, NC-17, Not Rated // ESRB Ratings: EC, E, E10+,T, M, AO, RP ′ratings′: [′NC-17,’ ′M′], ′categories′:[′Obama,’′America,’...] // Equivalent to Topics ′utc_start′: ′2000-01-0100:00:00,’ // Start Time, fixed to UTC ′utc_end′: ′2000-01-01 00:00:00,’// Start Time, fixed to UTC }, ′thumbnail′: { } };

Taste Graph Management

This component receives a user's ‘taste graph’ data and stores it forcomparison against video topics. This comparison matches the user withtopics first with topics from the existing topic index and second, withthe real-time stream of newly created topics as they are received. Theuser's taste graph combines profile information such as gender, age andlocation, with other user information such as the user's interests.

Just as categorization is applied to the incoming data for TopicExtraction 212 and video segmentation, the user's interests are alsoprocessed to provide equivalent ontological data for search matching.

Input

Users are offered the option to connect the application to externalnetworks to import existing taste graph information. Depending on theclient program, a user can input profile information, as well as theirspecific interests. Facebook's Open Graph (formerly Facebook Connect) isthe most prevalent example of this, but other available graphs, such asTwitter, LinkedIn, and others, can be used to gather user information.

FIG. 9 displays an example of information that can be extracted from anopen graph.

Process

The System 100 considers three specific data categories for processing:

Unique Data 901:

Unique information, such as User Id values, is valuable as part of auser's profile, but ignored in the process of topic relevance. Emailaddresses however are considered a factor when determining a user'sinterest simply because their sources can be used to determine whatkinds of data a user may be looking for. For example, a .edu address isless likely to be looking for current culture and more likely to belooking for factual information.

Discreet Data 902:

Discreet data can be organized into individual sets, such as Age,Location, Gender, and Religion, to form an applicable basis for topicmatching. Discreet data require less preprocessing by comparison thanabstract data does.

Abstract Data 903:

User profiles may contain interests in the form text keywords such asband names, books, sports, and film titles. Depending on the datasource, this information can be supplied with context. Interests such asBooks, Television, and Film interests may be separated into specificcategories or they may be grouped together.

The System 100 automatically attempts to categorize this abstract dataso that user profiles better match an index.

API Specific Data 904:

API specific data is data specific to the source of the social graph,for example Facebook User Name, Twitter User Name, etc. This informationis not highly value on it own, but it opens a path to yet moreinformation.

Preprocessing

Tokenization for interest graphs does not occur in the same way as itdoes for inbound text. An interest like “Kings of Leon” is taken as acomplete expression rather than being divided into individual words.However, in languages where appropriate, definite articles can bestripped for the purposes of the dictionary lookup; however theirpresence is still noted. For example, an interest in “The Beatles” isqueried against the index as “Beatles” but an additional parameter notesthe inclusion of the definite article “the” so that a result for theband, rather than the insect, is identified.

Dictionary Lookups

The dictionaries used are the same as those used to categorize inboundtext from the Capture Platform 110. Sets of common terms from tastegraph profiles form a valuable data source in determining popular,out-of-dictionary expressions. The System 100 define the dictionary asproviding the following:

-   -   Context:    -   “Goal” and “football” are associated with “sport,” “rain” with        “weather,” “Elvis” with “music,” etc.    -   Categorization:    -   For example, “Lady Gaga” is a “Person” in “Music”; “Apple” is an        organization; “Libya” is a “location”.    -   Additional metadata & normalization:    -   For example in a location based lookup, “London” can be expanded        to “Location: London, England, United Kingdom, Europe”.

Once these lookups have completed, we are left with basic profileinformation alongside and an expanded graph of user information.

Demographic & Weighting

As with the inbound data, categorization can yield multiple results forindividual keywords. Confidence values are applied to each. Because ofthe scale of data available, processing is assisted by determining abroad demographic for the user. If the System 100 is given the basicinformation that a user is a 24 year old US male with an interest insports; the System 100 can quickly provide a set of related topics.

Inverted Index

In order to expedite lookup of users from inbound topics (for real-timepush), profile data is stored in an inverted index, with topics formingthe key to weighted user lists. Instead of serving as the primary datastore, this data source is simply used to provide an optimized lookup.

Social Graph

In addition to the interests explicitly defined for a user, datarelating to the user's social graph of ‘friends’ or ‘connections’ isalso processed when available. While this data is not the primaryindicator of a user's interests, it does form a lower weighted componentto be used in determining a user's potential interests. In simple terms,if your friends all like rock music, the System 100 extrapolates thatyou may like rock music too.

Used exclusively, a user's friends' interests and broader demographicdata can be used to generate a subset of ‘recommended’ topic items thatfall outside a user's core interest graph. By representing these itemsand monitoring the users' explicit and implicit feedback to them (views,ratings, etc.) the System 100 can expand the user's set of interests andbroaden the media they are exposed to.

Custom Queue Delivery

The taste graph is paired with the index to create a list of relevantresults. These results may be from the index database as a whole, orreturned according to context (relevant in news, relevant in sport,relevant in music, relevant in a location, etc.).

Results are returned by the Search platform 120 in a result set asdescribed in the Search Box 1006.

Infrastructure

When a user first defines their interests, a large quantity ofinformation is processed to quickly return a relevant list of results.Because of this, the infrastructure is optimized to efficiently deliverresults using ‘big data’ database and server tuning techniques. Ratherthan implementing any platform logic, this sectioning is more a matterof server configuration, and is largely invisible to the platformprocessing.

Disk Segmentation:

At the lowest level, disk usage can be segmented in a Redundant Array ofIndependent Disks (RAID) configuration to increase performance.

Server Segmentation:

The data may also be split across severs to increase performance andredundancy. For the search index, ElasticSearch can be configured in amultiple server configuration, so that queries are processed across anumber of servers in parallel.

For the database, a vast majority of database solutions provide supportfor server clusters and data replication for high volume, highperformance data processing. The nature of the data means that it isbest suited to a non-relational database structure, which allows forless complex replication processes (compared to RDBMS solutions). As anexample, MongoDB provides relatively easily configured replication andstorage capabilities suitable for the platform service, but manyalternative options are available that may be employed (Cassandra,HBase, etc.)

Faceted Search

In order to provide context relevant results, the System 100 uses asearch index which provides support for ‘faceted search’. This meansthat the search keyword(s) can be supplemented with additional data tofilter results.

There is a connection between this and the ‘context’ of topics describedearlier. For example, a standard search for ‘Beatles’ may return resultsfor both the band or the animal in equal parts depending on what hasmore references in the index. In a facet assisted search, with ‘music’as the ‘facet’ (context), only results for the band will be returned.

As described in Topic Extraction 212, additional information providedfrom various dictionary lookups is added to the index with each topicbrought in from the Capture Platform 110. Additionally, taste graph data(the query) is given context in the same way. This provides the indexand query facets with additional information for search lookups.

Many search index technologies provide support for faceted search; suchas Lucene, and many if its' derivatives (ElasticSearch, Solr, etc.).

Caching

Caching occurs on several levels, thus exploiting the relatively cheapcost of storage over the less efficient use of processing power.

API outputs provide cached output data as appropriate, followingstandard web caching methodologies (e.g. content expiry dates, etags,server side rendered HTML output caching, database lookup caches, etc.).

Both the index and database contain internal transparent memorymanagement & caching to optimize disk IO (in the case ofSolr/ElasticSearch/MongoDB and many alternatives.)

By extracting the topic and preforming processing before adding data tothe index, the system enables the Search platform to provide more logicdriven caching options by creating separated indexes specific to searchquery types. For example, by knowing the source of content and contextof queries, the System 110 can apply different caching rulesaccordingly: a search for a personality would likely require an up todate, immediate result from un-cached indexes of news content, whereas asearch for “Polar Bears” could draw from a cached result.

Custom indexes are also created aggressively, because:

-   -   The data, because it is text simply text data, is relatively        small in terms of storage.    -   The processing cost of working across a large index is        relatively high.    -   The content is largely static, not frequently updated, and        therefore doesn't need constant processing power.

The indexes are created to satisfy requests by:

The User:

As topics are extracted, they are queried against an inverted index ofuser interests. Topics are then sent to individual user queues to bedrawn upon for relevant results the next time a query is made.

To prevent this preprocessing task from growing indefinitely, lists areonly maintained for users who have connected to the system within thepast 30 days. With this segmentation, an active user's search is passedthrough their personalized index and filtered according to query termsand facets.

Context Segmentation:

Faceted searches provide some efficiency within the index forcontext-based queries, these can be divided at processing time toprovide a single endpoint to client requests, for example to only returnresults relevant to sports, business news, music, etc.

These examples describe isolated indexes that duplicate data also storedin the core index. The core index is a single index of all topics thatis queried when a suitable smaller pre-cached index is not available. Inessence, the core index is a master index of all information stored inthe system, while isolated indexes are smaller, more specific indexesbased in trending topics.

Related Content:

When a user is reviewing a video, they may wish to view ‘related items’for that clip. In this case, the topic and context of each video can beused in place of the user's taste graph to retrieve and deliver relatedresults.

Recommendations:

In effect, the custom queue itself can be said to be a list ofrecommendations based on the users' interests. This concept is extendedto include the interests of their friends, which may not overlap withthe users' core interests, but is assumed to be loosely relevant.

Queue Ranking:

The queue offers options to client implementations to choose differentoptions for ordering a result list:

-   -   Time: Prioritize recent results (strictly or loosely)    -   Relevancy: Prioritize the most relevant results    -   Context Spread: With each result set, include a mixture of        contexts, for example: some news, entertainment, sport, etc.

Relevance is determined by topic and interest confidence matching and isdriven by Lucene, or an alternative search engine's, native relevancymechanisms.

Video Segmentation

In general, physical segmentation of video clips is avoided, to allowmodification of the clip timings at a later date: segmentation islogical and does not produce physical file changes. Topic start and endtimes are stored and delivered to the client with the topic data, whichthen sends them to the Media Storage and Delivery Platform 130 wherelong form files are read and the appropriate wrapper placed around theText Topic Data and Video Data.

The separation between the Media Storage and Delivery Platform 130 andSearch platform 120 is apparent here.

Incoming video is delivered to the Media Storage and Delivery Platform130 in continuous chunks, such as 24, 1 hour videos each day. Thesevideos are stored in a way that enables lookup based on the contentsource and timing information.

At the same time, the Search platform 120 extracts Topic IDs and storesthem in the Index of TV 215 along with their timing information. Thistiming information is encoded to the UTC time, so regional time zonevariations do not need to be considered. Topics returned to the client,such as search results or matches to a user's taste graph, contain startand end time information for the topic, which is passed to the MediaStorage and Delivery Platform 130 when the user requests playback.

Additionally, the channel name should be considered as a token, so wherethe channel name might be ‘CNN’ in practice, the actual disk structuremay follow a wider pattern to differentiate between ‘CNN-International,’CNN-HD′ and/or provide character formatting to compensate for filesystem restrictions on non-ASCII characters.

For example, if the client needs to access a video item, such as aspecific interview with Barak Obama, the Media Storage and DeliveryPlatform 130 is delivered the topic source, the topic start time, andthe topic end time:

{ location: ‘USA,’ channel: ‘CNN,’ start: ‘2011-01-01 17:12:22,’ end:‘2011-01-01 17:13:54’ }

Using this information, the Media Storage and Delivery Platform 130understands that the required video file is at the location:

-   -   /uk/cnn/2011-01-01 17:00:00

The platform also understands to start playback 12 minutes and 22seconds into the video block and end after playing for the video 1minute and 32 seconds. This is a good example of the process. Inpractice, the actual processing is slightly more complex, as multiplevideo files may be involved (if a request spans more than one file) inwhich case the streaming server compensates by combining two files intoan output.

There is another exception to this process, where the delivery servercannot provide dynamic streaming, for example Wowza Media Serverdelivers H.263 encoded 3gp videos, the Search platform 120 can queue aDelivery Platform API request to pre-encode content to the requiredformat, so that the Media Storage and Delivery Platform 130 will delivera specific pre-encoded video file (continuing the example above,“/uk/cnn/2011-01-01 17:12:33.3gp”).

GLOSSARY OF TERMS

API (Application Programming Interface): An API is a source code-basedspecification intended to be used as an interface by software componentsto communicate with each other. An API may include specifications forroutines, data structures, object classes, and variables.

CDN (Content Distribution Network): A Content Delivery Network is asystem of computers containing copies of data placed at various nodes ofa network. When properly designed and implemented, a CDN can improveaccess to the data it caches by increasing access bandwidth andredundancy, and reducing access latency.

Redundancy: Redundancy within a computer network means that multipleversions of a single piece of data exist in multiple places across anetwork. This is useful because it means that a program searching forthis information is more likely to find it, needs less bandwidth tocontinue its search, and, in the case of damage to a physical server,the data isn't truly gone because other copies of that data existelsewhere.

Client: Client, at least in the context of this document, is meant toindicate a program that interacts with the main Real-time Delivery ofSegmented Video, but is not a part of it. A client can be can beanything from a mobile phone app to a web-based user interface. For themost part, clients are used by users to access the database and retrievedata.

Client Devices: A Client Device is any device that runs a clientprogram, such as an Apple Iphone, an Android capable phone, or a TV withIPTV capabilities.

Cloud: Cloud infrastructure or simply “the cloud” is a system of dataorganization in which pieces of data are scattered across a network ofphysical servers. These servers can be pretty much anywhere in regardsto their physical location, but are all linked by a common cloudnetwork. Cloud infrastructure has many benefits, including a massivecapability for redundancy, a capability to store and efficiently uselocal and regional data, and a network that will lose little data in thecase that a physical server is damaged.

DVB (Digital Video Broadcasting): DVB is a suite of internationallyaccepted open standards for digital television. DVB standards aremaintained by the DVB Project, an international industry consortium withmore than 270 members, and they are published by a Joint TechnicalCommittee (JTC) of European Telecommunications Standards Institute(ETSI), European Committee for Electrotechnical Standardization(CENELEC) and European Broadcasting Union (EBU).

EPG (Electronic Programming Guide): EPG provides users of television,radio, and other media applications with continuously updated menusdisplaying broadcast programming or scheduling information for currentand upcoming programming.

Function: Function, at least in regards to the context of this document,is used to describe any task that a program or a component of a programis designed to do. For example, “The Capture Platform 110 provides anumber of functions:” simply means that the Capture Platform 110 has thecapability of performing a number of tasks.

IPTV (Internet Protocol Television): IPTV is a system in whichtelevision services are delivered using the Internet or a similarwide-scale network, instead of using traditional terrestrial, satellitesignal, and cable television formats.

JSON (JavaScript Object Notation): JSON is a lightweight text-based openstandard designed for human-readable data interchange.

Line 21: Line 21 (or EIA-608) is the standard for closed captioning inthe United States and Canada. It also defines Extended Data Service, ameans for including information, such as program name, in a televisiontransmission.

Long-Form Video: Long-Form video at least within the context of thisdocument, simply refers to video data before it has been processed. Theactual length of the video may vary, but in most cases it can be assumedto be about the length of a television show or movie.

Media RSS: RSS, originally called RDF site summary, is a family of webfeed formats used to publish frequently updated works. Media RSS simplyrefers to an RSS feed that is used for media.

OCR: Optical character recognition, or OCR, is the mechanical orelectronic translation of scanned images of handwritten, typewritten orprinted text into machine-encoded text. This conversion is used by theSystem 100 to translate close captioned text into a form that the SearchPlatform 120 is capable of reading.

RAID (Redundant Array of Independent Disks): RAID is a storagetechnology that combines multiple Physical storage servers so that theyfunction as a single unit. This single unit, known as a Logical unit,doesn't require that the servers be physically close, only that they arelinked by a network. Data is distributed across the drives in one ofseveral ways called “RAID levels,” depending on what level of redundancyand performance (via parallel communication) is required.

Relational Database Management System (RDBMS): RDBMS is a DatabaseManagement System in which data is stored in tables and therelationships between the data are also stored in tables. The data canbe accessed or reassembled in many different ways without requiring thatthe tables be changed.

Representational State Transfer (REST): REST is a form of softwarearchitecture for distributed hypermedia systems such as the World WideWeb. REST style architectures consist of clients and servers. Clientssend requests to servers; servers process requests and returnappropriate responses.

Social Graph: A social graph is a collection of data points thatrepresent a person's interests and how those interests interact. Socialgraphs can be expanded to include information about a group of people orabout a group of interests shared by multiple people.

Topic: A topic, according to this system, is a basic description of achunk of video. The topic can be broad, such as “Sports” or “News” orspecific, such as “Lady Gaga” or “Bill Gates.” A chunk of video can haveas many topics as is required to describe it. These topics are what thesearch platform 120 looks for when it attempts to find relevant videosto a search quere.

User: A user is anyone using the System 100 or one of its clients.

SUMMARY

While various embodiments have been described above, it should beunderstood that they have been presented by way of example only, and notlimitation. Thus, the breadth and scope of a preferred embodiment shouldnot be limited by any of the above described exemplary embodiments, butshould be defined only in accordance with the claims and theirequivalents for any patent that issues claiming priority from thepresent provisional patent application.

For example, as referred to herein, a machine or engine may be a virtualmachine, computer, node, instance, host, or machine in a networkedcomputing environment. Also as referred to herein, a networked computingenvironment is a collection of machines connected by communicationchannels that facilitate communications between machines and allow formachines to share resources. Network may also refer to a communicationmedium between processes on the same machine. Also as referred toherein, a server is a machine deployed to execute a program operating asa socket listener and may include software instances.

In all descriptions of “servers” or other computing devices herein,whether or not the illustrations of those servers or other computingdevices similarly show a server-like illustration in the figures, itshould be understood that any such described servers or computingdevices will similarly perform their described functions in accordancewith computer-readable instructions stored on a computer-readable mediathat are connected thereto.

Resources may encompass any types of resources for running instancesincluding hardware (such as servers, clients, mainframe computers,networks, network storage, data sources, memory, central processing unittime, scientific instruments, and other computing devices), as well assoftware, software licenses, available network services, and othernon-hardware resources, or a combination thereof.

A networked computing environment may include, but is not limited to,computing grid systems, distributed computing environments, cloudcomputing environment, etc. Such networked computing environmentsinclude hardware and software infrastructures configured to form avirtual organization comprised of multiple resources which may be ingeographically disperse locations.

Various terms used herein have special meanings within the presenttechnical field. Whether a particular term should be construed as such a“term of art,” depends on the context in which that term is used.“Connected to,” “in communication with,” or other similar terms shouldgenerally be construed broadly to include situations both wherecommunications and connections are direct between referenced elements orthrough one or more intermediaries between the referenced elements,including through the Internet or some other communicating network.“Network,” “system,” “environment,” and other similar terms generallyrefer to networked computing systems that embody one or more aspects ofthe present disclosure. These and other terms are to be construed inlight of the context in which they are used in the present disclosureand as those terms would be understood by one of ordinary skill in theart would understand those terms in the disclosed context. The abovedefinitions are not exclusive of other meanings that might be impartedto those terms based on the disclosed context.

Words of comparison, measurement, and timing such as “at the time,”“equivalent,” “during,” “complete,” and the like should be understood tomean “substantially at the time,” “substantially equivalent,”“substantially during,” “substantially complete,” etc., where“substantially” means that such comparisons, measurements, and timingsare practicable to accomplish the implicitly or expressly stated desiredresult.

Additionally, the section headings herein are provided for consistencywith the suggestions under 37 CFR 1.77 or otherwise to provideorganizational cues. These headings shall not limit or characterize theinvention(s) set out in any claims that may issue from this disclosure.Specifically and by way of example, although the headings refer to a“Technical Field,” such claims should not be limited by the languagechosen under this heading to describe the so-called technical field.Further, a description of a technology in the “Background” is not to beconstrued as an admission that technology is prior art to anyinvention(s) in this disclosure. Neither is the “Brief Summary” to beconsidered as a characterization of the invention(s) set forth in issuedclaims. Furthermore, any reference in this disclosure to “invention” inthe singular should not be used to argue that there is only a singlepoint of novelty in this disclosure. Multiple inventions may be setforth according to the limitations of the multiple claims issuing fromthis disclosure, and such claims accordingly define the invention(s),and their equivalents, that are protected thereby. In all instances, thescope of such claims shall be considered on their own merits in light ofthis disclosure, but should not be constrained by the headings set forthherein.

What is claimed is:
 1. A video capture system operable to capture livevideo content and to provide a personalized channel of recorded videocontent to a user based on the user's social interests and connections,the system comprising: a) a video capture server connected to a videosource, the video capture server operable to capture video signals fromthe video source; b) an initial processing server in communication withthe video capture server, the initial processing server operable toreceive the captured video signals from the video capture server andprocess the captured video signals to provide at least video files forstorage and text files associated with the captured video signals; andc) a topic extraction server in communication with the initialprocessing server and operable to receive the associated text files andperform contextual topic extracting and processing to provide additionalsearchable contextual information and to store the searchable contextualinformation and place this information, along with the text files in asearchable database archive, the stored information being associatedwith the stored video files, wherein the initial processing server isfurther operable to process secondary data associated with the capturedvideo signals to further supplement the text files associated with thecaptured video signals, wherein the secondary data comprises at leastone of Nielsen data, other TV audience data, and/or EPG data and enablesproviding of additional searchable contextual information.
 2. The videocapture system of claim 1, wherein the contextual topic extractinginvolves determining a change in context from a first portion of thecaptured video signals to a second portion of the captured videosignals.
 3. The video capture system of claim 1, wherein the videosource is at least one of an over-the-air, Internet Protocol TV, orcable broadcast source.
 4. The video capture system of claim 1, whereinthe topic extraction server is further in communication with acontextual database whereby the topic extraction server is able toperform the contextual topic extraction.
 5. The video capture system ofclaim 1, further comprising a metrics and trends processing machinewhich is operable to process the searchable contextual information fromthe topic extraction server in order to identify metrics and trends intopics being discussed on television, including using the contextualinformation applied from examination of a contextual database incommunication with the topic extraction server.
 6. The video capturesystem of claim 1, wherein the initial processing server is furtheroperable to extract text from the captured video signals using at leastone of optical text recognition of elements of the captured videosignals and voice recognition of a sound portion of the captured videosignals.
 7. The video capture system of claim 1, wherein the topicextraction server is operable, at least in part through use of thesecondary data, to detect topics that are trending on the video source.8. The video capture system of claim 1, further comprising a mediastorage and delivery platform in communication with the initialprocessing server, the media storage and delivery platform operable tostore the video files and associated text files provided by the initialprocessing server, wherein the media storage and delivery platform isoperable to segment video files according to user feedback.
 9. A videocapture system, operable to capture live video content and to provide apersonalized channel of recorded video content to a user based on theuser's social interests and connections, the system comprising: a) avideo capture server connected to a video source, the video captureserver operable to capture video signals from the video source; b) aninitial processing server in communication with the video captureserver, the initial processing server operable to receive the capturedvideo signals from the video capture server and process the capturedvideo signals to provide at least video files for storage and text filesassociated with the captured video signals; and c) atopic extractionserver in communication with a contextual database, wherein the topicextraction server is able to perform contextual topic extraction,wherein the initial processing server is further operable to processsecondary data associated with the captured video signals to furthersupplement the text files associated with the captured video signals,wherein the secondary data comprises at least one of Nielsen data, otherTV audience data, and/or EPG data and enables providing of additionalsearchable contextual information.
 10. A method for capturing live videocontent, the method comprising: capturing video signals in a videocapture server connected to a video source; receiving the captured videosignals from the video capture server in an initial processing server;processing the captured video signals in the initial processing serverto provide at least video files for storage and text files associatedwith the captured video signals; receiving the associated text files ina topic extraction server in communication with the initial processingserver and performing contextual topic extracting and processing toprovide additional searchable contextual information; storing thesearchable contextual information and placing this information, alongwith the original text files in a searchable database archive, thestored information being associated with the stored video files; andprocessing secondary data associated with the captured video signals tofurther supplement the text files associated with the captured videosignals, wherein the secondary data comprises at least one of Nielsendata, other TV audience data, and/or EPG data and enables providing ofadditional searchable contextual information.